[CodeLayout] Faster basic block reordering, ext-tsp #68617

spupyrev · 2023-10-09T18:28:26Z

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats all blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:

reduced the maximum chain size to 512 (from the prior 4096);
introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
Before:

benchmark 1:
num functions: 13,047
reordering running time is 2.4 seconds
score: 125503458 (128.3102%)
benchmark 2:
num functions: 16,438
reordering running time is 3.4 seconds
score: 12613997277 (129.7495%)
benchmark 3:
num functions: 12,359
reordering running time is 1.9 seconds
score: 1315881613 (105.8991%)
benchmark 4:
num functions: 96,588
reordering running time is 7.3 seconds
score: 89513906284 (100.3413%)
benchmark 5:
num functions: 1
reordering running time is 372 seconds
score: 21292505965077 (99.9979%)
benchmark 6:
num functions: 71,155
reordering running time is 314 seconds
score: 29795381626270671437824 (102.7519%)

After:

benchmark 1:
reordering running time is 2.2 seconds
score: 125510418 (128.3130%)
benchmark 2:
reordering running time is 2.6 seconds
score: 12614502162 (129.7525%)
benchmark 3:
reordering running time is 1.6 seconds
score: 1315938168 (105.9024%)
benchmark 4:
reordering running time is 4.9 seconds
score: 89518095837 (100.3454%)
benchmark 5:
reordering running time is 4.8 seconds
score: 21292295939119 (99.9971%)
benchmark 6:
reordering running time is 104 seconds
score: 29796710925310302879744 (102.7565%)

llvmbot · 2023-10-09T20:01:46Z

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Changes

Aggressive inlining might produce huge functions with >10K of basic
blocks. Since BFI treats all blocks and jumps as "hot" having
non-negative (but perhaps small) weight, the current implementation can
be slow, taking minutes to produce an layout. This change introduces a
few modifications that significantly (up to 50x on some instances)
speeds up the computation. Some notable changes:

reduced the maximum chain size to 512 (from the prior 4096);
introduced MaxMergeDensityRatio param to avoid merging chains with
very different densities;
dropped a couple of params that seem unnecessary.

Looking at some "offline" metrics (e.g., the number of created
fall-throughs), there shouldn't be problems; in fact, I do see some
metrics go up. But it might be hard/impossible to measure perf
difference for such small changes. I did test the performance clang-14
binary and do not record a perf or i-cache-related differences.

My 5 benchmarks, with ext-tsp runtime (the lower the better) and
"tsp-score" (the higher the better).
Before:

benchmark 1:
reordering running time is 2486 milliseconds
score: 125503458 (128.3102%)
benchmark 2:
reordering running time is 3443 milliseconds
score: 12613997277 (129.7495%)
benchmark 2:
reordering running time is 1978 milliseconds
score: 1315881613 (105.8991%)
benchmark 4:
reordering running time is 7364 milliseconds
score: 89513906284 (100.3413%)
benchmark 5:
reordering running time is 372605 milliseconds
score: 21292505965077 (99.9979%)

After:

benchmark 1:
reordering running time is 2498 milliseconds
score: 125510418 (128.3173%)
benchmark 2:
reordering running time is 3201 milliseconds
score: 12614502162 (129.7547%)
benchmark 3:
reordering running time is 2137 milliseconds
score: 1315938168 (105.9036%)
benchmark 4:
reordering running time is 6242 milliseconds
score: 89518095837 (100.3460%)
benchmark 5:
reordering running time is 5819 milliseconds
score: 21292295939119 (99.9969%)

Patch is 22.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68617.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Utils/CodeLayout.cpp (+143-100)
(modified) llvm/test/CodeGen/X86/code_placement_ext_tsp.ll (+2-11)
(modified) llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll (+8-8)

diff --git a/llvm/lib/Transforms/Utils/CodeLayout.cpp b/llvm/lib/Transforms/Utils/CodeLayout.cpp
index 057a5e86c04aca1..a7e1393d693e9ef 100644
--- a/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ b/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -99,22 +99,15 @@ static cl::opt<unsigned> BackwardDistance(
     cl::desc("The maximum distance (in bytes) of a backward jump for ExtTSP"));
 
 // The maximum size of a chain created by the algorithm. The size is bounded
-// so that the algorithm can efficiently process extremely large instance.
+// so that the algorithm can efficiently process extremely large instances.
 static cl::opt<unsigned>
-    MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(4096),
-                 cl::desc("The maximum size of a chain to create."));
+    MaxChainSize("ext-tsp-max-chain-size", cl::ReallyHidden, cl::init(512),
+                 cl::desc("The maximum size of a chain to create"));
 
-// The maximum size of a chain for splitting. Larger values of the threshold
-// may yield better quality at the cost of worsen run-time.
-static cl::opt<unsigned> ChainSplitThreshold(
-    "ext-tsp-chain-split-threshold", cl::ReallyHidden, cl::init(128),
-    cl::desc("The maximum size of a chain to apply splitting"));
-
-// The option enables splitting (large) chains along in-coming and out-going
-// jumps. This typically results in a better quality.
-static cl::opt<bool> EnableChainSplitAlongJumps(
-    "ext-tsp-enable-chain-split-along-jumps", cl::ReallyHidden, cl::init(true),
-    cl::desc("The maximum size of a chain to apply splitting"));
+// The maximum ratio between densities of two chains for merging.
+static cl::opt<double> MaxMergeDensityRatio(
+    "ext-tsp-max-merge-density-ratio", cl::ReallyHidden, cl::init(1000),
+    cl::desc("The maximum ratio between densities of two chains for merging"));
 
 // Algorithm-specific options for CDS.
 static cl::opt<unsigned> CacheEntries("cds-cache-entries", cl::ReallyHidden,
@@ -217,11 +210,13 @@ struct NodeT {
   NodeT &operator=(const NodeT &) = delete;
   NodeT &operator=(NodeT &&) = default;
 
-  explicit NodeT(size_t Index, uint64_t Size, uint64_t EC)
-      : Index(Index), Size(Size), ExecutionCount(EC) {}
+  explicit NodeT(size_t Index, uint64_t Size, uint64_t Count)
+      : Index(Index), Size(Size), ExecutionCount(Count) {}
 
   bool isEntry() const { return Index == 0; }
 
+  // Check if Other is a successor of the node.
+  bool isSuccessor(const NodeT *Other) const;
   // The total execution count of outgoing jumps.
   uint64_t outCount() const;
 
@@ -442,6 +437,14 @@ struct ChainEdge {
   bool CacheValidBackward{false};
 };
 
+bool NodeT::isSuccessor(const NodeT *Other) const {
+  for (JumpT *Jump : OutJumps) {
+    if (Jump->Target == Other)
+      return true;
+  }
+  return false;
+}
+
 uint64_t NodeT::outCount() const {
   uint64_t Count = 0;
   for (JumpT *Jump : OutJumps)
@@ -475,14 +478,14 @@ void ChainT::mergeEdges(ChainT *Other) {
   }
 }
 
-using NodeIter = std::vector<NodeT *>::const_iterator;
+/// A wrapper around three concatenated vectors (chains) of objects; it is used
+/// to avoid extra instantiation of the vectors.
+template <typename ObjType> struct MergedVector {
+  using ObjIter = typename std::vector<ObjType *>::const_iterator;
 
-/// A wrapper around three chains of nodes; it is used to avoid extra
-/// instantiation of the vectors.
-struct MergedChain {
-  MergedChain(NodeIter Begin1, NodeIter End1, NodeIter Begin2 = NodeIter(),
-              NodeIter End2 = NodeIter(), NodeIter Begin3 = NodeIter(),
-              NodeIter End3 = NodeIter())
+  MergedVector(ObjIter Begin1, ObjIter End1, ObjIter Begin2 = ObjIter(),
+               ObjIter End2 = ObjIter(), ObjIter Begin3 = ObjIter(),
+               ObjIter End3 = ObjIter())
       : Begin1(Begin1), End1(End1), Begin2(Begin2), End2(End2), Begin3(Begin3),
         End3(End3) {}
 
@@ -507,13 +510,26 @@ struct MergedChain {
 
   const NodeT *getFirstNode() const { return *Begin1; }
 
+  bool empty() const { return Begin1 == End1; }
+
+  void append(ObjIter Begin, ObjIter End) {
+    if (Begin2 == End2) {
+      Begin2 = Begin;
+      End2 = End;
+      return;
+    }
+    assert(Begin3 == End3 && "cannot extend MergedVector");
+    Begin3 = Begin;
+    End3 = End;
+  }
+
 private:
-  NodeIter Begin1;
-  NodeIter End1;
-  NodeIter Begin2;
-  NodeIter End2;
-  NodeIter Begin3;
-  NodeIter End3;
+  ObjIter Begin1;
+  ObjIter End1;
+  ObjIter Begin2;
+  ObjIter End2;
+  ObjIter Begin3;
+  ObjIter End3;
 };
 
 /// Merge two chains of nodes respecting a given 'type' and 'offset'.
@@ -521,29 +537,29 @@ struct MergedChain {
 /// If MergeType == 0, then the result is a concatenation of two chains.
 /// Otherwise, the first chain is cut into two sub-chains at the offset,
 /// and merged using all possible ways of concatenating three chains.
-MergedChain mergeNodes(const std::vector<NodeT *> &X,
-                       const std::vector<NodeT *> &Y, size_t MergeOffset,
-                       MergeTypeT MergeType) {
+MergedVector<NodeT> mergeNodes(const std::vector<NodeT *> &X,
+                               const std::vector<NodeT *> &Y,
+                               size_t MergeOffset, MergeTypeT MergeType) {
   // Split the first chain, X, into X1 and X2.
-  NodeIter BeginX1 = X.begin();
-  NodeIter EndX1 = X.begin() + MergeOffset;
-  NodeIter BeginX2 = X.begin() + MergeOffset;
-  NodeIter EndX2 = X.end();
-  NodeIter BeginY = Y.begin();
-  NodeIter EndY = Y.end();
+  MergedVector<NodeT>::ObjIter BeginX1 = X.begin();
+  MergedVector<NodeT>::ObjIter EndX1 = X.begin() + MergeOffset;
+  MergedVector<NodeT>::ObjIter BeginX2 = X.begin() + MergeOffset;
+  MergedVector<NodeT>::ObjIter EndX2 = X.end();
+  MergedVector<NodeT>::ObjIter BeginY = Y.begin();
+  MergedVector<NodeT>::ObjIter EndY = Y.end();
 
   // Construct a new chain from the three existing ones.
   switch (MergeType) {
   case MergeTypeT::X_Y:
-    return MergedChain(BeginX1, EndX2, BeginY, EndY);
+    return MergedVector<NodeT>(BeginX1, EndX2, BeginY, EndY);
   case MergeTypeT::Y_X:
-    return MergedChain(BeginY, EndY, BeginX1, EndX2);
+    return MergedVector<NodeT>(BeginY, EndY, BeginX1, EndX2);
   case MergeTypeT::X1_Y_X2:
-    return MergedChain(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);
+    return MergedVector<NodeT>(BeginX1, EndX1, BeginY, EndY, BeginX2, EndX2);
   case MergeTypeT::Y_X2_X1:
-    return MergedChain(BeginY, EndY, BeginX2, EndX2, BeginX1, EndX1);
+    return MergedVector<NodeT>(BeginY, EndY, BeginX2, EndX2, BeginX1, EndX1);
   case MergeTypeT::X2_X1_Y:
-    return MergedChain(BeginX2, EndX2, BeginX1, EndX1, BeginY, EndY);
+    return MergedVector<NodeT>(BeginX2, EndX2, BeginX1, EndX1, BeginY, EndY);
   }
   llvm_unreachable("unexpected chain merge type");
 }
@@ -618,6 +634,10 @@ class ExtTSPImpl {
     AllChains.reserve(NumNodes);
     HotChains.reserve(NumNodes);
     for (NodeT &Node : AllNodes) {
+      // Adjust execution counts.
+      Node.ExecutionCount = std::max(Node.ExecutionCount, Node.inCount());
+      Node.ExecutionCount = std::max(Node.ExecutionCount, Node.outCount());
+      // Create a chain.
       AllChains.emplace_back(Node.Index, &Node);
       Node.CurChain = &AllChains.back();
       if (Node.ExecutionCount > 0)
@@ -630,13 +650,13 @@ class ExtTSPImpl {
       for (JumpT *Jump : PredNode.OutJumps) {
         NodeT *SuccNode = Jump->Target;
         ChainEdge *CurEdge = PredNode.CurChain->getEdge(SuccNode->CurChain);
-        // this edge is already present in the graph.
+        // This edge is already present in the graph.
         if (CurEdge != nullptr) {
           assert(SuccNode->CurChain->getEdge(PredNode.CurChain) != nullptr);
           CurEdge->appendJump(Jump);
           continue;
         }
-        // this is a new edge.
+        // This is a new edge.
         AllEdges.emplace_back(Jump);
         PredNode.CurChain->addEdge(SuccNode->CurChain, &AllEdges.back());
         SuccNode->CurChain->addEdge(PredNode.CurChain, &AllEdges.back());
@@ -649,7 +669,7 @@ class ExtTSPImpl {
   /// to B are from A. Such nodes should be adjacent in the optimal ordering;
   /// the method finds and merges such pairs of nodes.
   void mergeForcedPairs() {
-    // Find fallthroughs based on edge weights.
+    // Find forced pairs of blocks.
     for (NodeT &Node : AllNodes) {
       if (SuccNodes[Node.Index].size() == 1 &&
           PredNodes[SuccNodes[Node.Index][0]].size() == 1 &&
@@ -699,28 +719,44 @@ class ExtTSPImpl {
     /// Deterministically compare pairs of chains.
     auto compareChainPairs = [](const ChainT *A1, const ChainT *B1,
                                 const ChainT *A2, const ChainT *B2) {
-      if (A1 != A2)
-        return A1->Id < A2->Id;
-      return B1->Id < B2->Id;
+      return std::make_tuple(A1->Id, B1->Id) < std::make_tuple(A2->Id, B2->Id);
     };
 
+    double PrevScore = 1e9;
     while (HotChains.size() > 1) {
       ChainT *BestChainPred = nullptr;
       ChainT *BestChainSucc = nullptr;
       MergeGainT BestGain;
       // Iterate over all pairs of chains.
       for (ChainT *ChainPred : HotChains) {
+        // Since the score of merging doesn't increase, we can stop early when
+        // the newly found merge is as good as the previous one.
+        if (BestGain.score() == PrevScore)
+          break;
         // Get candidates for merging with the current chain.
         for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
           // Ignore loop edges.
-          if (ChainPred == ChainSucc)
+          if (Edge->isSelfEdge())
             continue;
-
           // Stop early if the combined chain violates the maximum allowed size.
           if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
             continue;
+          // Don't merge the chains if they have vastly different densities.
+          // We stop early if the ratio between the densities exceeds
+          // MaxMergeDensityRatio. Smaller values of the option result in
+          // fewer merges (hence, more chains), which in turn typically yields
+          // smaller size of the hot code section.
+          double minDensity =
+              std::min(ChainPred->density(), ChainSucc->density());
+          double maxDensity =
+              std::max(ChainPred->density(), ChainSucc->density());
+          assert(minDensity > 0.0 && maxDensity > 0.0 &&
+                 "incorrectly computed chain densities");
+          const double Ratio = maxDensity / minDensity;
+          if (Ratio > MaxMergeDensityRatio)
+            continue;
 
-          // Compute the gain of merging the two chains.
+          // Compute the gain of merging the two chains
           MergeGainT CurGain = getBestMergeGain(ChainPred, ChainSucc, Edge);
           if (CurGain.score() <= EPS)
             continue;
@@ -732,6 +768,9 @@ class ExtTSPImpl {
             BestGain = CurGain;
             BestChainPred = ChainPred;
             BestChainSucc = ChainSucc;
+            // Stop early when the merge is as good as the previous one.
+            if (BestGain.score() == PrevScore)
+              break;
           }
         }
       }
@@ -741,6 +780,7 @@ class ExtTSPImpl {
         break;
 
       // Merge the best pair of chains.
+      PrevScore = BestGain.score();
       mergeChains(BestChainPred, BestChainSucc, BestGain.mergeOffset(),
                   BestGain.mergeType());
     }
@@ -769,24 +809,25 @@ class ExtTSPImpl {
   }
 
   /// Compute the Ext-TSP score for a given node order and a list of jumps.
-  double extTSPScore(const MergedChain &MergedBlocks,
-                     const std::vector<JumpT *> &Jumps) const {
-    if (Jumps.empty())
+  double extTSPScore(const MergedVector<NodeT> &Nodes,
+                     const MergedVector<JumpT> &Jumps) const {
+    if (Jumps.empty() || Nodes.empty())
       return 0.0;
+
     uint64_t CurAddr = 0;
-    MergedBlocks.forEach([&](const NodeT *Node) {
+    Nodes.forEach([&](const NodeT *Node) {
       Node->EstimatedAddr = CurAddr;
       CurAddr += Node->Size;
     });
 
     double Score = 0;
-    for (JumpT *Jump : Jumps) {
+    Jumps.forEach([&](const JumpT *Jump) {
       const NodeT *SrcBlock = Jump->Source;
       const NodeT *DstBlock = Jump->Target;
       Score += ::extTSPScore(SrcBlock->EstimatedAddr, SrcBlock->Size,
                              DstBlock->EstimatedAddr, Jump->ExecutionCount,
                              Jump->IsConditional);
-    }
+    });
     return Score;
   }
 
@@ -798,17 +839,15 @@ class ExtTSPImpl {
   /// element being the corresponding merging type.
   MergeGainT getBestMergeGain(ChainT *ChainPred, ChainT *ChainSucc,
                               ChainEdge *Edge) const {
-    if (Edge->hasCachedMergeGain(ChainPred, ChainSucc)) {
+    if (Edge->hasCachedMergeGain(ChainPred, ChainSucc))
       return Edge->getCachedMergeGain(ChainPred, ChainSucc);
-    }
 
     // Precompute jumps between ChainPred and ChainSucc.
-    auto Jumps = Edge->jumps();
-    ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);
-    if (EdgePP != nullptr) {
-      Jumps.insert(Jumps.end(), EdgePP->jumps().begin(), EdgePP->jumps().end());
-    }
+    MergedVector<JumpT> Jumps(Edge->jumps().begin(), Edge->jumps().end());
     assert(!Jumps.empty() && "trying to merge chains w/o jumps");
+    ChainEdge *EdgePP = ChainPred->getEdge(ChainPred);
+    if (EdgePP != nullptr)
+      Jumps.append(EdgePP->jumps().begin(), EdgePP->jumps().end());
 
     // This object holds the best chosen gain of merging two chains.
     MergeGainT Gain = MergeGainT();
@@ -836,35 +875,36 @@ class ExtTSPImpl {
     Gain.updateIfLessThan(
         computeMergeGain(ChainPred, ChainSucc, Jumps, 0, MergeTypeT::X_Y));
 
-    if (EnableChainSplitAlongJumps) {
-      // Attach (a part of) ChainPred before the first node of ChainSucc.
-      for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
-        const NodeT *SrcBlock = Jump->Source;
-        if (SrcBlock->CurChain != ChainPred)
-          continue;
-        size_t Offset = SrcBlock->CurIndex + 1;
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
-      }
+    // Attach (a part of) ChainPred before the first node of ChainSucc.
+    for (JumpT *Jump : ChainSucc->Nodes.front()->InJumps) {
+      const NodeT *SrcBlock = Jump->Source;
+      if (SrcBlock->CurChain != ChainPred)
+        continue;
+      size_t Offset = SrcBlock->CurIndex + 1;
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::X2_X1_Y});
+    }
 
-      // Attach (a part of) ChainPred after the last node of ChainSucc.
-      for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
-        const NodeT *DstBlock = Jump->Target;
-        if (DstBlock->CurChain != ChainPred)
-          continue;
-        size_t Offset = DstBlock->CurIndex;
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
-      }
+    // Attach (a part of) ChainPred after the last node of ChainSucc.
+    for (JumpT *Jump : ChainSucc->Nodes.back()->OutJumps) {
+      const NodeT *DstBlock = Jump->Target;
+      if (DstBlock->CurChain != ChainPred)
+        continue;
+      size_t Offset = DstBlock->CurIndex;
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1});
     }
 
     // Try to break ChainPred in various ways and concatenate with ChainSucc.
-    if (ChainPred->Nodes.size() <= ChainSplitThreshold) {
-      for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
-        // Try to split the chain in different ways. In practice, applying
-        // X2_Y_X1 merging is almost never provides benefits; thus, we exclude
-        // it from consideration to reduce the search space.
-        tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
-                                 MergeTypeT::X2_X1_Y});
-      }
+    // In practice, applying X2_Y_X1 merging is almost never provides benefits;
+    // thus, we exclude it from consideration to reduce the search space.
+    for (size_t Offset = 1; Offset < ChainPred->Nodes.size(); Offset++) {
+      // Do not split the chain along a jump.
+      const NodeT *BB = ChainPred->Nodes[Offset - 1];
+      const NodeT *BB2 = ChainPred->Nodes[Offset];
+      if (BB->isSuccessor(BB2))
+        continue;
+
+      tryChainMerging(Offset, {MergeTypeT::X1_Y_X2, MergeTypeT::Y_X2_X1,
+                               MergeTypeT::X2_X1_Y});
     }
     Edge->setCachedMergeGain(ChainPred, ChainSucc, Gain);
     return Gain;
@@ -875,19 +915,20 @@ class ExtTSPImpl {
   ///
   /// The two chains are not modified in the method.
   MergeGainT computeMergeGain(const ChainT *ChainPred, const ChainT *ChainSucc,
-                              const std::vector<JumpT *> &Jumps,
+                              const MergedVector<JumpT> &Jumps,
                               size_t MergeOffset, MergeTypeT MergeType) const {
-    auto MergedBlocks =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(ChainPred->Nodes, ChainSucc->Nodes, MergeOffset, MergeType);
 
     // Do not allow a merge that does not preserve the original entry point.
     if ((ChainPred->isEntry() || ChainSucc->isEntry()) &&
-        !MergedBlocks.getFirstNode()->isEntry())
+        !MergedNodes.getFirstNode()->isEntry())
       return MergeGainT();
 
     // The gain for the new chain.
-    auto NewGainScore = extTSPScore(MergedBlocks, Jumps) - ChainPred->Score;
-    return MergeGainT(NewGainScore, MergeOffset, MergeType);
+    double NewScore = extTSPScore(MergedNodes, Jumps);
+    double CurScore = ChainPred->Score;
+    return MergeGainT(NewScore - CurScore, MergeOffset, MergeType);
   }
 
   /// Merge chain From into chain Into, update the list of active chains,
@@ -897,7 +938,7 @@ class ExtTSPImpl {
     assert(Into != From && "a chain cannot be merged with itself");
 
     // Merge the nodes.
-    MergedChain MergedNodes =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
     Into->merge(From, MergedNodes.getNodes());
 
@@ -908,8 +949,10 @@ class ExtTSPImpl {
     // Update cached ext-tsp score for the new chain.
     ChainEdge *SelfEdge = Into->getEdge(Into);
     if (SelfEdge != nullptr) {
-      MergedNodes = MergedChain(Into->Nodes.begin(), Into->Nodes.end());
-      Into->Score = extTSPScore(MergedNodes, SelfEdge->jumps());
+      MergedNodes = MergedVector<NodeT>(Into->Nodes.begin(), Into->Nodes.end());
+      MergedVector<JumpT> MergedJumps(SelfEdge->jumps().begin(),
+                                      SelfEdge->jumps().end());
+      Into->Score = extTSPScore(MergedNodes, MergedJumps);
     }
 
     // Remove the chain from the list of active chains.
@@ -1255,7 +1298,7 @@ class CDSortImpl {
   }
 
   /// Compute the change of the distance locality after merging the chains.
-  double distBasedLocalityGain(const MergedChain &MergedBlocks,
+  double distBasedLocalityGain(const MergedVector<NodeT> &MergedBlocks,
                                const std::vector<JumpT *> &Jumps) const {
     if (Jumps.empty())
       return 0.0;
@@ -1283,7 +1326,7 @@ class CDSortImpl {
     assert(Into != From && "a chain cannot be merged with itself");
 
     // Merge the nodes.
-    MergedChain MergedNodes =
+    MergedVector<NodeT> MergedNodes =
         mergeNodes(Into->Nodes, From->Nodes, MergeOffset, MergeType);
     Into->merge(From, MergedNodes.getNodes());
 
diff --git a/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll b/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
index 4053b8a8e123b1c..be0b9820e145415 100644
--- a/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
+++ b/llvm/test/CodeGen/X86/code_placement_ext_tsp.ll
@@ -1,6 +1,5 @@
 ;; See also llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
 ; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
-; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=0 -ext-tsp-enable-chain-split-along-jumps=0 < %s | FileCheck %s -check-prefix=CHECK2
 
 define void @func1a()  {
 ; Test that the algorithm positions the most likely successor first
@@ -329,8 +328,8 @@ end:
 }
 
 define void @func4() !prof !11 {
-; Test verifying that, if enabled, chains can be split in order to improve the
-; objective (by creating more fallthroughs)
+; Test verifying that chains can be split in order to improve the objective
+; by cr...
[truncated]

david-xl · 2023-10-09T20:14:26Z

Curious: why is benchmark 3's runtime increasing (marked as benchmark 2 in Before)?

spupyrev · 2023-10-09T20:40:23Z

Curious: why is benchmark 3's runtime increasing (marked as benchmark 2 in Before)?

I wasn't particularly careful with measuring build times here, as these are seconds, while the overall compilation takes minutes/hours. We should treat the "reordering runtime" here as given with a +/- 10% precision, so the difference for "benchmark 3" (sorry for the typo) is likely just noise. In my mind, the times for benchmarks 1,2,3 are unchanged, while there is small but measurable (~1 sec) improvement for benchmark 4, and a large improvement for benchmark 5.
Let me know if you'd like to get higher precision.

llvm/lib/Transforms/Utils/CodeLayout.cpp

rlavaee · 2023-10-12T18:01:53Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

+      // Do not split the chain along a jump.
+      const NodeT *BB = ChainPred->Nodes[Offset - 1];
+      const NodeT *BB2 = ChainPred->Nodes[Offset];
+      if (BB->isSuccessor(BB2))


What's the rationale for this? Sometimes a loop is rotated to improve overall ext-tsp score even though the chain is split along a jump.

This is purely for speeding up the algorithm. You're right that it results in (slightly) worse solutions but we get a significant speedup.
I expanded the comment here: We still can rotate loops via one of the loops above. This check is essentially prevents creating solutions with fewer (unweighted) fallthroughs.

llvm/lib/Transforms/Utils/CodeLayout.cpp

rlavaee · 2023-10-12T18:09:56Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

    while (HotChains.size() > 1) {
      ChainT *BestChainPred = nullptr;
      ChainT *BestChainSucc = nullptr;
      MergeGainT BestGain;
      // Iterate over all pairs of chains.
      for (ChainT *ChainPred : HotChains) {
+        // Since the score of merging doesn't increase, we can stop early when


Is this always true? What if merging two chains A and B allows the total gain of merging <A.B> with another chain C to increase because both A and B have branches to C?

Again, this is for speeding up the algorithm. The empirical speedup seems to be worth the quality regression.
Adjusted the comment.

github-actions · 2023-10-13T00:35:11Z

✅ With the latest revision this PR passed the Python code formatter.

github-actions · 2023-10-13T00:35:55Z

✅ With the latest revision this PR passed the C/C++ code formatter.

spupyrev · 2023-10-13T01:10:22Z

After more thorough testing, I had to bring back ChainSplitThreshold and modify the tests accordingly as otherwise some tests were up to 4x slower. Having this threshold regresses the quality a little bit, but I don't see a proper way of removing it.
Adjusted the numbers on my benchmarks to reflect the change.

MaskRay · 2023-10-13T22:21:45Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

@@ -725,6 +734,7 @@ class ExtTSPImpl {
      return std::make_tuple(A1->Id, B1->Id) < std::make_tuple(A2->Id, B2->Id);
    };

+    double PrevScore = 1e9;


This should use a sentinel value instead of a magic 1e9.

You can use std::optional<double> for an absent value.

removed the variable, as it seems to confuse people while the benefit is modest

MaskRay · 2023-10-13T22:22:49Z

llvm/test/CodeGen/X86/code_placement_ext_tsp.ll

@@ -1,6 +1,5 @@
 ;; See also llvm/unittests/Transforms/Utils/CodeLayoutTest.cpp
 ; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
-; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=0 -ext-tsp-enable-chain-split-along-jumps=0 < %s | FileCheck %s -check-prefix=CHECK2



Why is the -ext-tsp-chain-split-threshold test removed?

The test relies on another option, ext-tsp-enable-chain-split-along-jumps, which is now deleted.

MaskRay · 2023-10-13T22:41:25Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

@@ -753,15 +774,23 @@ class ExtTSPImpl {
            BestGain = CurGain;
            BestChainPred = ChainPred;
            BestChainSucc = ChainSucc;
+            // Stop early when the merge is as good as the previous one.


If PrevScore is changed to a std::optional, add an assert(!PrevScore || BestGain.score() <= *PrevScore + EPS)

llvm/lib/Transforms/Utils/CodeLayout.cpp

spupyrev · 2023-10-23T17:41:14Z

Are there any more comments or suggestions? Thanks

rlavaee · 2023-10-23T19:37:20Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

          // Stop early if the combined chain violates the maximum allowed size.
          if (ChainPred->numBlocks() + ChainSucc->numBlocks() >= MaxChainSize)
            continue;
+          // Don't merge the chains if they have vastly different densities.
+          // We stop early if the ratio between the densities exceeds


Skip rather than stop?

rlavaee · 2023-10-23T19:38:25Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

+          // We stop early if the ratio between the densities exceeds
+          // MaxMergeDensityRatio. Smaller values of the option result in
+          // fewer merges (hence, more chains), which in turn typically yields
+          // smaller size of the hot code section.


I am confused by this comment. How can the hot section get smaller? Yes, we don't merge the chains, but they still get concatenated at the end.

I adjusted the comment (and the one above). Please check if the new one makes sense. Thanks

MaskRay · 2023-10-25T00:22:53Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

@@ -446,6 +448,14 @@ struct ChainEdge {
  bool CacheValidBackward{false};
 };

+bool NodeT::isSuccessor(const NodeT *Other) const {
+  for (JumpT *Jump : OutJumps) {


We omit braces in this case https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements

MaskRay · 2023-10-25T00:24:35Z

llvm/lib/Transforms/Utils/CodeLayout.cpp

@@ -741,12 +750,23 @@ class ExtTSPImpl {
        // Get candidates for merging with the current chain.
        for (const auto &[ChainSucc, Edge] : ChainPred->Edges) {
          // Ignore loop edges.
-          if (ChainPred == ChainSucc)
+          if (Edge->isSelfEdge())


Nit: This is often called self-loop or loop. If you think fine, I'll push a NFC commit renaming isSelfEdge to isSelfLoop...

spupyrev requested review from MaskRay, aaupov, rlavaee, htyu, david-xl and maksfb October 9, 2023 20:00

spupyrev marked this pull request as ready for review October 9, 2023 20:00

llvmbot added backend:X86 llvm:transforms labels Oct 9, 2023

david-xl reviewed Oct 9, 2023

View reviewed changes

llvm/lib/Transforms/Utils/CodeLayout.cpp Outdated Show resolved Hide resolved

rlavaee reviewed Oct 12, 2023

View reviewed changes

spupyrev requested review from JDevlieghere, nikic and a team as code owners October 13, 2023 00:23

spupyrev marked this pull request as draft October 13, 2023 00:24

spupyrev closed this Oct 13, 2023

spupyrev force-pushed the exttsp-faster-v2 branch from 8d5eefd to 5bf701a Compare October 13, 2023 00:25

spupyrev removed request for a team, nikic and JDevlieghere October 13, 2023 00:26

spupyrev reopened this Oct 13, 2023

spupyrev marked this pull request as ready for review October 13, 2023 00:47

spupyrev mentioned this pull request Oct 13, 2023

[ELF] Making cdsort default for function reordering #68638

Merged

MaskRay reviewed Oct 13, 2023

View reviewed changes

llvm/lib/Transforms/Utils/CodeLayout.cpp Outdated Show resolved Hide resolved

spupyrev mentioned this pull request Oct 16, 2023

[CodeLayout] cache-directed sort: limit max chain size #69039

Merged

rlavaee reviewed Oct 23, 2023

View reviewed changes

rlavaee approved these changes Oct 24, 2023

View reviewed changes

spupyrev force-pushed the exttsp-faster-v2 branch from a6f1545 to 5d00154 Compare October 25, 2023 00:21

MaskRay reviewed Oct 25, 2023

View reviewed changes

MaskRay approved these changes Oct 25, 2023

View reviewed changes

spupyrev force-pushed the exttsp-faster-v2 branch from 5d00154 to 5623b3c Compare October 25, 2023 00:31

spupyrev added 5 commits October 25, 2023 07:51

review

541d16a

adjusted tests

bd1a837

Got rid of confusing PrevScore

f81b32d

adjusted comments

78028ad

extra braces

b27f1ee

spupyrev force-pushed the exttsp-faster-v2 branch from 5623b3c to b27f1ee Compare October 25, 2023 14:52

spupyrev merged commit cc2fbc6 into llvm:main Oct 25, 2023

aaupov mentioned this pull request Nov 2, 2023

[BOLT] X86/clang-nolbr.test fails with assertion added in #68617 #71064

Closed

spupyrev deleted the exttsp-faster-v2 branch November 16, 2023 18:50

[CodeLayout] Faster basic block reordering, ext-tsp #68617

[CodeLayout] Faster basic block reordering, ext-tsp #68617

Uh oh!

Conversation

spupyrev commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-xl commented Oct 9, 2023

Uh oh!

spupyrev commented Oct 9, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spupyrev commented Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaskRay Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spupyrev commented Oct 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spupyrev commented Oct 9, 2023 •

edited

Loading

llvmbot commented Oct 9, 2023 •

edited

Loading

github-actions bot commented Oct 13, 2023 •

edited

Loading

github-actions bot commented Oct 13, 2023 •

edited

Loading

spupyrev commented Oct 13, 2023 •

edited

Loading

MaskRay Oct 13, 2023 •

edited

Loading